Annotation and Issues in Building an English Dependency Treebank
نویسندگان
چکیده
The Paninian Grammar framework, given by Panini for his analysis of Sanskrit Language, is finding its extensive application on languages other than Sanskrit, about two thousand five hundred years after its formulation. The work presented in this paper is one such application that extends Paninian Grammar (PG or CPG: Computational Paninian Grammar) to English, a fixed word order language. It presents how CPG can account for English and makes available, a linguistically rich resource in the form of an English Dependency Treebank. At present, 2000 sentences have been annotated as part of this effort, using the Hyderabad Dependency Treebank (HyDT) Annotation Scheme for Indian languages, (modelled on CPG). In the course of this paper we talk about CPG and the annotation scheme used for this work. We then talk about the task of annotation of the English language data per the scheme and how its application to English varies from Hindi. Further, we discuss our handling of some constructions of English, and some anomalies in the language that pose a challenge to the application of this annotation scheme to English, as is.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملAnnotation Procedure in Building the Prague Czech-English Dependency Treebank
In this paper, we present some organizational aspects of building of a large corpus with rich linguistic annotation, while Prague Czech-English Dependency Treebank (PCEDT) serves as an example. We stress the necessity to divide the annotation process into several well planed phases. We present a system of automatic checking of the correctness of the annotation and describe several ways to measu...
متن کاملTowards Building Parallel Dependency Treebanks: Intra-Chunk Expansion and Alignment for English Dependency Treebank
The paper presents our work on the annotation of intra-chunk dependencies on an English treebank that was previously annotated with Inter-chunk dependencies, and for which there exists a fully expanded parallel Hindi dependency treebank. This provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task...
متن کامل(Pre-)Annotation of Topic-Focus Articulation in Prague Czech-English Dependency Treebank
The objective of the present contribution is to give a survey of the annotation of information structure in the Czech part of the Prague Czech-English Dependency Treebank. We report on this first step in the process of building a parallel annotation of information structure in this corpus, and elaborate on the automatic pre-annotation procedure for the Czech part. The results of the pre-annotat...
متن کاملBuilding the multilingual TUT parallel treebank
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French annotated in the pure dependency format of the Turin University Treebank, i.e. Parallel–TUT. We hypothesize that the major features of this annotation format can be of some help in addressing the typical issues related to parallel corpora, e.g. alignment at various levels. Therefor...
متن کامل